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ABSTRACT 



The homogeneity test provided by L. Hedges (1982) in 
meta-analysis has been widely used, mainly to test if the effect sizes share 
the same variance. Ignoring the intercorrelations among effect sizes affects 
the Type I error rate. The main purpose of this research was to study the 
impact of pooling effect sizes on the homogeneity test in effect size 
analyses. Simulations were conducted to study the effects of pooling effect 
sizes on Type I error rate and power of the Q test for varying sample size 
(N) , number of studies (k) , and proportion of pooling effect sizes (p) in the 
k studies. Simulation results show that composite meta-analysis seems to have 
smaller Type I error than typical meta-analysis. The difference in Type I 
error between typical and composite meta-analysis is relatively big when 
sample size or number of studies is large. Composite meta-analysis always has 
greater Type II error and smaller power than typical meta-analysis. These 
results mean that more caution is necessary when pooling effect sizes. 
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The meta-analytic techniques to synthesize related studies have been widely used in the 
social sciences. One of common approaches before estimating a mean effect size is to test if the 
effect sizes share a common population effect size. If the effect sizes do not share a common 
population effect size then sensitivity analyses are conducted to examine the influence of 
particular studies on combined effect size estimate. The homogeneity test provided by Hedges 
(1982) in meta-analysis has been widely used to test mainly if the effect sizes share same 
variance. 

One typical feature of meta-analyses is treating multiple outcomes from single samples 
as if they were independent in calculating a grand mean effect size. Ignoring the 
intercorrelations among effect sizes affects the Type I error rate (Raudenbush, Becker, & 
Kalaian, 1988). The author’s latest research showed that typical meta-analyses had a tendency 
of more significant results in homogeneity test, and categorical and regression analyses than 
when controlling dependent effect sizes (Kim, 1999). However, when correlation among 
dependent effect sizes is too low or when we are not sure if the effect sizes are dependent or 
not, then just combining or pooling effect sizes might bring some problems. 

As mentioned above, in meta-analyses the effect-size analysis can involve two levels of 
statistical tests. Two types of decision-making errors in the first-stage homogeneity test can also 
affect the second-stage test of the magnitude of the common effect size. The main purpose of 
this research is to study the impact of pooling effect sizes on homogeneity test in effect size 
analyses. 



1 Paper presented at the annual meeting of the American Educational Research Association, New 
Orleans, April, 2000. 
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Nonindependence Issue 

Landman and Dawes (1982) cautioned about five sources of nonindependence in meta- 
analysis. First, they cite multiple measures of outcomes from the same subjects within single 
studies; second, measures taken at multiple points in time from the same subjects (i.e., multiple 
occasions); third, nonindependence of scores within a single outcome measure; fourth, 
nonindependence of studies within a single article; and fifth, nonindependent samples across 
articles (p. 506-507). The second through fifth cases can be controlled by careful decision- 
making. For instance, when the same tests are examined several times in a study, only the last 
occasion could be selected (e.g., Kulik, Kulik, & Cohen, 1979). The third case happens when a 
study reports both a global index as well as more specific index, which is a part of the global 
index. In this case, choosing the specific index is ideal if it allows the study of interesting 
moderator variables. The fourth case occurs when some samples from two different 
experiments in a study are overlapping or the same. The same decision-making may be applied 
as used for the third case to arrive at independent samples. The last case may happen if the same 
sample appears in two different articles. In this case the more informative article should be 
selected. All four ways to treat nonindependence are used in this synthesis. However, the first 
case cannot be controlled by a decision-making, but by statistical consideration. 

One common approach to the situation is for the meta-analyst to use all the statistics 
available in a particular study to calculate one mean effect size (Tracz, Elmore, & Pohlmann, 
1992). The typical analysis then to treats each effect size from a given study as independent of 
the other effect sizes from the same study (e.g.. Smith, Glass, & Miller, 1980). However, Glass, 
McGaw, and Smith (1981) recognized that ‘'the data set to be analyzed [in a meta-analysis] will 
invariably contain complicated patterns of statistical dependence [since] each study is likely to 
yield more than one finding” (p.200). Bangert-Drowns (1986) stated, “multiple effect sizes 
from any one study cannot be regarded as independent and should not be used with statistical 
tests that assume their independence” (p. 397). In the same article (p. 392), he discussed the 
“Inflated Ns” problem. A report can have a greater influence on the meta-analytic findings if it 
uses many dependent measures. The “Inflated Ns” problem threatens the generalizability or 
external validity of a meta-analysis. Another problem is inflated Type I error (Raudenbush et 
al., 1988). Strube (1983) mentioned a general rule, that is, failure to adjust for nonindependence 
inflates the Type I error rate at the meta-analysis level. 

Researchers have devised several methods for combining dependent data in meta- 
analysis. A strategy for reducing dependence of data is to select, on some predetermined basis. 
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a single dependent measure to represent each study (Cooper, 1979). But, the question “what is 
the best indicator among several dependent variables?” is too ambiguous. It is very difficult to 
make such a decision. A common strategy for dealing with studies that use multiple outcomes 
has been to average. This makes sense for providing a representative effect size estimate when 
the outcomes are parallel measures of a single construct (Raudenbush et. al., 1988). Instead of 
the mean, the median effect size is a more conservative option. Raudenbush et al. (1988) refer 
to this approach as study effect meta-analysis (p. 393) because of treating the study as the unit 
of analysis. 

A statistical solution for this nonindependence problem within a study has been 
developed by Rosenthal and Rubin (1986). When the study has a big sample size and small 
differences of the intercorrelations between outcome measures, they suggest computing a 
composite effect size. Gleser and Olkin (1994) also showed how to calculate composite effect 
sizes within studies by using all individual intercorrelations among outcome variables. One 
difference between these two procedures is that Rosenthal and Rubin (1986) use a ‘'typical” 
correlation, which is a correlation representative of all intercorrelations between the multiple 
measures. 

One common feature of above approaches is calculating a representative effect sizes for 
dependent effect sizes. Combining dependent effect sizes to create one representative one for 
one measure from same sample seems to be reasonable. However, if the dependence is not 
certain, then just combining or pooling effect sizes may bring some problems for Type I error 
rate and power rate for statistical test in meta-analysis, mainly due to reduced sample sizes. 



Q statistics 



The biased effect size for each study is computed by 



4 = (-y, zlA 
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where Xc and X, e are the means in the rth study and S, is the pooled standard deviation for 
study i calculated as: 
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where n tE and n iC are the number of each compared groups. The unbiased effect size (corrected 
for small sample bias) is calculated as 
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where «,=«,£• + n iC , with the conditional variance 
ni di 2 

v,= + — (4) 
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(Hedges & Olkin, 1985). The sample size varies across studies. Since estimates from the larger 
studies are more precise than estimates from smaller studies, larger studies are given more 
weight when the effects are pooled. The weight w, = 7/v, is used. A pooled effect, or weighted 
mean effect (T.) can be calculated as: 



T. = 




(5) 



with a variance 

1 



(Shadish & Haddock, 1994, p.265). In order to determine whether the studies can reasonably be 
described as sharing a common effect size, the following statistical test for homogeneity of 
effect size was performed (Hedges & Olkin, 1985): 
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The test statistic Q has an asymptotic chi-square distribution with k—1 degrees of freedom. 
When test statistic Q is greater than the critical value with k-1 degree of freedom, it is 
determined that the synthesis has heterogeneous data. 
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If the Si as a parameter effect sizes are not equal then Q has a noncentral chi-square 
distribution k-1 degree of freedom and noncentrality parameter (Chang, 1992). 






<rj 

s i J 



Methods 

Based on the main question “What are the effects of pooling effect sizes on Type I error 
rate and power of the Q test for varying study sample size (N), number of studies (k), and 
proportion of pooling effect sizes (p) in the k studies?” following procedures were 
implemented. 

Simulation factors 

The factors and their values reflect those of Harwell (1995), Chang (1992), and Hedges 
and Olkin (1985). Based on normal distribution of scores, numbers of effect sizes modeled were 
k = 5, 10, and 30 with group sample sizes of 10-10, 30-30 and an extreme value of 300-300. A 
reason of including one extreme sample size of 300 was to see its particular tendency about 
Type I error rate. In fact, there are many primary studies that contain more than 300 sample 
sizes in real setting. Unequal sample sizes were not included, but three different proportions of 
pooling effect sizes were considered. They were 20, 40 and 60% of the whole effect sizes. For 
instance, 2 effect sizes were assumed to correlate (i.e., came from same sample of one primary 
study) out of 5 studies when 40% proportion was used. Thus, the number of pooling effect sizes 
varied across different study sizes. These proportion of pooling effect sizes and studies were 
based on the author’s last study (Kim, 1999). Pooling effect sizes were considered came from 
one, two, or three samples (primary studies) for each k = 5, 10 (except 20% proportion case), 
and 30 respectively. The noncentrality patterns were shown in Figure 1 including other factors. 
Only one value of 6 was used because of its simplicity and middle size considering Chang 
(1992) and Harwell (1995). 



Insert Figure 1 about here 
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Data generation 

The data generation was done using MATLAB 5.3. The procedures were as follows. 
(A) Sample effect sizes were obtained from noncentral t statistics, computed using normal 
deviates and chi-squared random numbers generated. (B) Constants equal to the specified delta 
values were added to scores to create the desired noncentrality pattern as way of Chang (1992). 
The formula used for part (A) and (B) is as follow, (a) Got a normal deviate (Zj). (b) Multiplied 

Zi by y J{N i 1 n i x n t ) . (c) Added noncentrality pattern of 8. (d) And divided it by ^[Ch~7df , 

where Ch, is a chi-square random number (Chang, 1992). (C) For composite meta-analysis, 
mean effect size(s) for each specified studies was (were) gained before computing Q statistic. 
(D) The Q statistic was computed for the k effect sizes using equation (7) far each typical and 
composite effect sizes, (E) Step (A) to (C) were repeated 2000 times [the same number of 
replications employed by Harwell (1996), Chang (1992), and Hedges and Olkin (1985)] for 
each combination of simulation factors. The proportions of significant Q tests across the 2000 
replication represented empirical type I error rates and power values from typical and composite 
meta-analysis were compared to see the effect of pooling of effect sizes on the Q test. Overall 3 
(different proportion of pooling effect sizes; 2 when k = 5.) X 3 (number of studies) X 3 
(sample sizes) X 4 or 5 (sets of 5 values) design was replicated. 

Results 



Adequacy of the simulation 

One evidence of adequacy of the simulation is the mean effect sizes across the 
conditions studied. Table 1 shows mean effect sizes from 2000 simulation. 



Insert Table 1 about here 



For instance, all numbers in three first patterns of each proportion are pretty closer to 
zero, which indicates adequacy of the simulation. Another evidence is that Type I error rates 
and Power rates for typical meta-analysis are similar to theoretical rates. Those Type I error 
rates are close to 5.0 and delta pattern 2 and 5 (4 when k = 10) in 60% proportion possess pretty 
close values with theoretical values (See Table 2). 
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Insert Table 2 about here 



Type I error rates of the O test 

The first rows of each proportion represents empirical error rates since delta pattern 
does not include any nonzero of 5. All of Type I error rates are pretty close to .05 for typical 
cases when sample sizes are big (30 and 300). Comparing Type I errors between typical and 
composite meta-analysis, composite meta-analysis always has smaller Type I error rates than 
typical case. This finding implies that pooling effect sizes is too conservative to reject the null 
hypothesis. One particular feature is that the difference of Type I error rates between two 
approaches is increasing when proportion of pooling effect sizes is increasing. This implies that 
the more pooling effect sizes a meta-analysis possesses, the more conservative the analysis is in 
rejecting the null hypothesis. Figure 2 presents this tendency clearly. Most cases show 
increasing Type I error rate difference (typical minus composite meta-analysis) across the 
proportion of pooling effect sizes. 



Insert Figure 2 about here 



Power of the Q test 

When delta has not zeros, the sets estimate power values for the Q test. These values 
have similar patterns to the Type I error. Since most of power rate close to 1 when sample size 
is 300, it would be not explained from now. Power was largest for a given delta when n (fixing 
k) and k (fixing n) were larger. This tendency agrees with Chang (1992) and Harwell (1995), 
even if the values are a little different. Power for typical meta-analysis is always greater than 
composite meta-analysis. This means that composite meta-analysis underestimates statistical 
power that is supposed to. In other words, when independent effect sizes are considered as 
dependent, the statistical inference about Q statistic might overestimate Type II error and 
underestimate statistical power. Factors affecting the differences were detected with Figure 3. 



Insert Figure 3 about here 
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When sample sizes, number of studies, and number of .5 of deltas in pooling effect 
sizes is bigger, the power difference (typical minus composite meta-analysis) between typical 
and composite meta-analysis is bigger. This indicates that under above conditions there are 
more possibilities for composite meta-analysis to overestimate Type II error and underestimate 
power of the Q test. 



Conclusions 

Composite meta-analysis seems to have smaller Type I error than typical meta-analysis. 
The difference in Type I error between typical and composite meta-analysis is relatively big 
when sample size and/or number of studies is big. This finding can be explained as follows. 
Composite meta-analysis is too conservative to reject the null hypothesis of homogeneity test. 
Then people tend to retain the null hypothesis when the alternative hypothesis is true. In turn, 
for example, people tend to use fixed effect model, to test if grand mean effect size is 
significant. Then, people tend to reject the null hypothesis of grand mean effect size test even if 
the alternative hypothesis is false because one uses smaller standard error than supposed to. 
Finally, people more likely commit Type I error for the test of grand mean effect sizes. 

Harwell (1995) summarized that Chang (1992) suggested that meta-analysts should be 
concerned about the Type II errors since a Q test which was under-powered would lead to an 
unacceptably high probability of wrongly concluding that the model fits the data (p. 2). This 
study showed that composite meta-analysis always has greater Type II error and smaller power 
than typical meta-analysis. The difference in power rate between typical and composite meta- 
analysis is relatively big when sample size, number of studies, and number of .5 of deltas in 
pooling effect sizes is big. Thus more cautions are necessary when pooling (or combining) 
effect sizes. 

In the future, a study to see the impact of dependence of effect sizes on Type I error & 
power rate of homogeneity test when ignoring the dependence might be pursued. For this, since 
the correlations among the dependent effect sizes might be taken account, Harwell (1995)’s 
previous data generation procedures and Gleser & Olkin (1994)’s formulas to generate 
dependent effect sizes are necessary to generate dependent effect sizes. In addition, it would be 
nicer if we can see the effect of dependence on random effect model of homogeneity too. 
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Table 1 . Means of simulated effect sizes 
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Table 2. Type I error rate and power rate for the Q test 
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Figure 1. Pattern of pooling effect sizes with values of deltas 

* pd: proportion of pooling effect sizes in a meta-ana;ysis 
** delta: Patterns of deltas 
*** k: Numer of studies 

k composite: Number of studies after pooling effect sizes 
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Type I error rate difference 



Figure 2. Type I error difference between typical and 
composite meta-analysis 
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Figure 3. Power difference between typical and composite meta-analysis 




No. of studies 



D 30.00 



* 10.00 



° 5.00 



17 



Abstract 



One typical feature of meta-analyses is treating multiple outcomes from single samples as 
if they were independent in calculating a grand mean effect size. Ignoring the intercorrelations 
among effect sizes affects the Type I error rate. However, when correlation among dependent 
effect sizes is too low or when we are not sure if the effect sizes are dependent or not, then just 
combining or pooling effect sizes might bring some problems. The main purpose of this research 
is to study the impact of pooling effect sizes on homogeneity test in effect size analyses. Based on 
the main question “What are the effects of pooling effect sizes on Type I error rate and power of 
the Q test with 3 sample sizes, 3 number of studies, 3 proportion of pooling effect sizes in the k 
studies, and 4 or 5 kinds of noncentrality patterns?”, 2000 replications were implemented. 

Composite meta-analysis seems to have smaller Type I error than typical meta-analysis. 
The difference in Type I error between typical and composite meta-analysis is relatively big when 
sample size and/or number of studies is big. This finding implies that composite meta-analysis is 
too conservative to reject the null hypothesis of homogeneity test, in turn, has more likely higher 
Type I error and lower power for the test of grand mean effect sizes. This study also showed that 
composite meta-analysis always has greater Type II error and smaller power than typical meta- 
analysis. The difference in power rate between typical and composite meta-analysis is relatively 
big when sample size, number of studies, and number of .5 of deltas in pooling effect sizes is big. 
This results recommend that more cautions are necessary when pooling (or combining) effect 
sizes. 
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